An Algebraic Approach to Data Quality Metrics for Entity Resolution Over Large Datasets

نویسندگان

  • John Talburt
  • Richard Wang
چکیده

This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on " persons of interest " across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results and outline areas and approaches to further research in this area.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Object Level Strategy for Spectral Quality Assessment of High Resolution Pan-sharpen Images

Panchromatic and multi-spectral images produced by the remote sensing satellites are fused together to provide a multi-spectral image with a high spatial resolution at the same time. The spectral quality of the fused images is very important because the quality of a large number of remote sensing products depends on it. Due to the importance of the spectral quality of the fused images, its eval...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

Water Quality Restoration Using Landscape Metrics Analysis: A Case Study in the Golestan Province of Iran

Abstract The results of an integrated study aimed at restoring water quality in a large watershed including seven catchments in north east Iran are presented in this paper. This case study demonstrates how landscape metrics reflect direct or surrogate causes of the land use practices that are the determinants of water quality parameters. Water quality factors included EC, pH, Cl-1, HCO3-1, SO4-...

متن کامل

Performance Bounds for Pairwise Entity Resolution

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small holdout datasets, there is no guarantee this performance holds on larger hold-out datasets....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016